AITopics | zero-shot robustness

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

Neural Information Processing SystemsMar-22-2026, 02:06:45 GMT

CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module.

artificial intelligence, large language model, natural language, (11 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.36)

Add feedback

A Appendix A.1 UniBench Implementation Details We have developed UniBench

Neural Information Processing SystemsFeb-16-2026, 19:08:51 GMT

To evaluate new VLMs that expand beyond the already implemented 59 VLMs, users need to follow Code Snippet 2. Users would need to create a class that inherent from As described in Section 2.2, LLM-style models defined as models that generate tokens/text as output. Thereby, making them hard to compare with CLIP-style VLMs. Following Matsuura et al. [2023] methodology, we evaluated Llava 1.5 [Liu et al., 2023] - a LLM-style VLM - on various benchmark types in UniBench (Table 2). Scaling improves many benchmarks, but offers little benefit for reasoning and relation. Figure 8: Benchmark capabilities performance does not scale with dataset and model size Median zero-shot performance of models on various benchmark capabilities.

benchmark, large language model, machine learning, (18 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)

Add feedback

A Appendix A.1 UniBench Implementation Details We have developed UniBench

Neural Information Processing SystemsOct-10-2025, 10:21:39 GMT

To evaluate new VLMs that expand beyond the already implemented 59 VLMs, users need to follow Code Snippet 2. Users would need to create a class that inherent from As described in Section 2.2, LLM-style models defined as models that generate tokens/text as output. Thereby, making them hard to compare with CLIP-style VLMs. Following Matsuura et al. [2023] methodology, we evaluated Llava 1.5 [Liu et al., 2023] - a LLM-style VLM - on various benchmark types in UniBench (Table 2). Scaling improves many benchmarks, but offers little benefit for reasoning and relation. Figure 8: Benchmark capabilities performance does not scale with dataset and model size Median zero-shot performance of models on various benchmark capabilities.

benchmark, digit, recognition standard, (14 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.87)

Add feedback

64f1f27bf1b4ec22924fd0acb550c235-Supplemental.pdf

Neural Information Processing SystemsAug-14-2025, 21:17:10 GMT

cityscape, encoder, segformer, (14 more...)

Neural Information Processing Systems

Country:

Asia > China > Jiangsu Province > Nanjing (0.05)
Asia > China > Hong Kong (0.05)

Technology: Information Technology > Artificial Intelligence > Machine Learning (0.69)

Add feedback

Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models

Neural Information Processing SystemsMay-27-2025, 12:43:00 GMT

CLIP), have attracted widespread attention and adoption across various domains. Nonetheless, CLIP has been observed to be susceptible to adversarial examples. Through experimental analysis, we have observed a phenomenon wherein adversarial perturbations induce shifts in text-guided attention. Building upon this observation, we propose a simple yet effective strategy: Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR). This framework incorporates two components: the Attention Refinement module and the Attention-based Model Constraint module.

text-guided attention, vision-language model, zero-shot robustness, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.73)

Add feedback